Project 4: Exploratory Data Analysis

Explore and Summarise Data

Data Analyst Nanodegree (Udacity)

Project submission by Edward Minnett (ed@methodic.io).

February 20th 2017 (Revision 2)


Data Source

This report is a ‘stream of consciousness’ exploration of a pair of data sets. One data set represents the physical characteristics and perceived quality of white wine while other describes the same features for red wine. The data is provided by Cortez et al. as apart of their 2009 paper Modeling wine preferences by data mining from physicochemical properties published by Elsevier.

In the author’s own words:

This dataset is publicly available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

A more detailed description of the data set can be found in the README.md file that accompanies this report.

Univariate Plots Section

Initial Summary Exploration

When exploring data for the first time, it helps to get a very high level view of the whole data set in the hope of getting an idea where to zoom in and explore in more detail.

To begin with, what is the size and shape of the data set? This summary includes the head of the data frame.

White Wine

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Red Wine

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The data for both red and white wines are described by 12 variables. There are 4898 observations for the white wine, but only 1599 observations for the red wine.

What are the summary statistics for each feature?

White Wine

##                      vars    n   mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 4898   6.85  0.84   6.80    6.82  0.74 3.80  14.20  10.40 0.65     2.17 0.01
## volatile.acidity        2 4898   0.28  0.10   0.26    0.27  0.09 0.08   1.10   1.02 1.58     5.08 0.00
## citric.acid             3 4898   0.33  0.12   0.32    0.33  0.09 0.00   1.66   1.66 1.28     6.16 0.00
## residual.sugar          4 4898   6.39  5.07   5.20    5.80  5.34 0.60  65.80  65.20 1.08     3.46 0.07
## chlorides               5 4898   0.05  0.02   0.04    0.04  0.01 0.01   0.35   0.34 5.02    37.51 0.00
## free.sulfur.dioxide     6 4898  35.31 17.01  34.00   34.36 16.31 2.00 289.00 287.00 1.41    11.45 0.24
## total.sulfur.dioxide    7 4898 138.36 42.50 134.00  136.96 43.00 9.00 440.00 431.00 0.39     0.57 0.61
## density                 8 4898   0.99  0.00   0.99    0.99  0.00 0.99   1.04   0.05 0.98     9.78 0.00
## pH                      9 4898   3.19  0.15   3.18    3.18  0.15 2.72   3.82   1.10 0.46     0.53 0.00
## sulphates              10 4898   0.49  0.11   0.47    0.48  0.10 0.22   1.08   0.86 0.98     1.59 0.00
## alcohol                11 4898  10.51  1.23  10.40   10.43  1.48 8.00  14.20   6.20 0.49    -0.70 0.02
## quality                12 4898   5.88  0.89   6.00    5.85  1.48 3.00   9.00   6.00 0.16     0.21 0.01

Red Wine

##                      vars    n  mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 1599  8.32  1.74   7.90    8.15  1.48 4.60  15.90  11.30 0.98     1.12 0.04
## volatile.acidity        2 1599  0.53  0.18   0.52    0.52  0.18 0.12   1.58   1.46 0.67     1.21 0.00
## citric.acid             3 1599  0.27  0.19   0.26    0.26  0.25 0.00   1.00   1.00 0.32    -0.79 0.00
## residual.sugar          4 1599  2.54  1.41   2.20    2.26  0.44 0.90  15.50  14.60 4.53    28.49 0.04
## chlorides               5 1599  0.09  0.05   0.08    0.08  0.01 0.01   0.61   0.60 5.67    41.53 0.00
## free.sulfur.dioxide     6 1599 15.87 10.46  14.00   14.58 10.38 1.00  72.00  71.00 1.25     2.01 0.26
## total.sulfur.dioxide    7 1599 46.47 32.90  38.00   41.84 26.69 6.00 289.00 283.00 1.51     3.79 0.82
## density                 8 1599  1.00  0.00   1.00    1.00  0.00 0.99   1.00   0.01 0.07     0.92 0.00
## pH                      9 1599  3.31  0.15   3.31    3.31  0.15 2.74   4.01   1.27 0.19     0.80 0.00
## sulphates              10 1599  0.66  0.17   0.62    0.64  0.12 0.33   2.00   1.67 2.42    11.66 0.00
## alcohol                11 1599 10.42  1.07  10.20   10.31  1.04 8.40  14.90   6.50 0.86     0.19 0.03
## quality                12 1599  5.64  0.81   6.00    5.59  1.48 3.00   8.00   5.00 0.22     0.29 0.02

We will discuss the relevant statistics in more detail when we take a closer look at the distribution of each feature.

Before plotting the features, it is worth doing an analysis to see if there are any obvious outliers that affect the data as a whole. For this analysis, I will be using Cook’s distance based on a linear model for the quality feature for each of the two types of wine. The analysis will use a threshold of 1 for points that exert disproportional influence on the model.

White Wine

Red Wine

This leaves us with a single outlier within the white wine data. From this point on, this datum will be excluded from the analysis. Any outliers that are present for specific features of the data may be excluded for individual plots (and will be mentioned if this is the case), but those outliers will not be excluded from the dataset when discussing other features.

The method of filtering outliers out for individual features will involve setting upper and lower fences that are 1.5 times the interquartile range (IQR) above the 3rd quartile and below the 1st quartile. Another way to word this is the thresholds will be 2 times the IQR from the median (+/-). This is the same method used to distinguish outliers when drawing a box and whisker plot.

White Wines

Now let’s take a look at the general distribution for each of the white wine features. As this is the first time we are looking at each of these distributions, there isn’t a specific justification for each plot apart from the fact that we want to see how that feature is distributed. For these plots, we will include the outliers as it is important to understand how they affect the distributions before they are removed.

## [1] "Summary statistics for White Wine: Fixed Acidity"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 6.85 0.84    6.8    6.82 0.74 3.8 14.2  10.4 0.65     2.17 0.01
## [1] "Quantiles for White Wine: Fixed Acidity"
##   0%  25%  50%  75% 100% 
##  3.8  6.3  6.8  7.3 14.2 
## [1] "Interquartile Range for White Wine: Fixed Acidity"
## [1] 1

The majority of the fixed acidity data for white wine is reasonably symmetrically distributed around the mean with exception of a few outliers to the right of the distribution.

## [1] "Summary statistics for White Wine: Volatile Acidity"
##    vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 4897 0.28 0.1   0.26    0.27 0.09 0.08 1.1  1.02 1.54      4.8  0
## [1] "Quantiles for White Wine: Volatile Acidity"
##   0%  25%  50%  75% 100% 
## 0.08 0.21 0.26 0.32 1.10 
## [1] "Interquartile Range for White Wine: Volatile Acidity"
## [1] 0.11

Volatile acidity for white wine has a slightly skewed right distribution with quite a few outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Citric Acid"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 4897 0.33 0.12   0.32    0.33 0.09   0 1.66  1.66 1.28     6.18  0
## [1] "Quantiles for White Wine: Citric Acid"
##   0%  25%  50%  75% 100% 
## 0.00 0.27 0.32 0.39 1.66 
## [1] "Interquartile Range for White Wine: Citric Acid"
## [1] 0.12

Citric acid for white wine has quite a narrow peak and is quite symmetrically distributed around the mean with only a few distant outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Residual Sugar"
##    vars    n mean sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 6.38  5    5.2     5.8 5.34 0.6 31.6    31 0.79    -0.22 0.07
## [1] "Quantiles for White Wine: Residual Sugar"
##   0%  25%  50%  75% 100% 
##  0.6  1.7  5.2  9.9 31.6 
## [1] "Interquartile Range for White Wine: Residual Sugar"
## [1] 8.2

Residual sugar for white wine is very skewed to the right. There is a very narrow peak close to 0 with a long right-hand tail and a few outliers beyond the tail.

## [1] "Summary statistics for White Wine: Chlorides"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 4897 0.05 0.02   0.04    0.04 0.01 0.01 0.35  0.34 5.02    37.53  0
## [1] "Quantiles for White Wine: Chlorides"
##    0%   25%   50%   75%  100% 
## 0.009 0.036 0.043 0.050 0.346 
## [1] "Interquartile Range for White Wine: Chlorides"
## [1] 0.014

Chlorides for white wine has a very sharp peak just to left of the mean with very flat tails. The right-hand tail is very long with quite a few outliers beyond the tail.

## [1] "Summary statistics for White Wine: Free Sulfur Dioxide"
##    vars    n  mean sd median trimmed   mad min max range skew kurtosis   se
## X1    1 4897 35.31 17     34   34.36 16.31   2 289   287 1.41    11.46 0.24
## [1] "Quantiles for White Wine: Free Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    2   23   34   46  289 
## [1] "Interquartile Range for White Wine: Free Sulfur Dioxide"
## [1] 23

Free sulfur dioxide for white wine is quite symmetrically distributed around the mean with a slightly longer tail to the right with a few outliers close to the tail and one quite far beyond the right-hand tail.

## [1] "Summary statistics for White Wine: Total Sulfur Dioxide"
##    vars    n   mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 4897 138.36 42.5    134  136.95  43   9 440   431 0.39     0.57 0.61
## [1] "Quantiles for White Wine: Total Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    9  108  134  167  440 
## [1] "Interquartile Range for White Wine: Total Sulfur Dioxide"
## [1] 59

Total sulfur dioxide for white wine has quite a wide peak with a steep slope on the left side of the distribution and shallower slope on the right-hand side. there are quite a few outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Density"
##    vars    n mean sd median trimmed mad  min  max range skew kurtosis se
## X1    1 4897 0.99  0   0.99    0.99   0 0.99 1.01  0.02 0.31     -0.4  0
## [1] "Quantiles for White Wine: Density"
##      0%     25%     50%     75%    100% 
## 0.98711 0.99172 0.99374 0.99610 1.01030 
## [1] "Interquartile Range for White Wine: Density"
## [1] 0.00438

Density for white wine falls within a very narrow range of values. The range is only 0.02319 g / cm^3. This leaves the distribution with a very broad peak and short tails with a couple of outliers beyond the right tail.

## [1] "Summary statistics for White Wine: pH"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 4897 3.19 0.15   3.18    3.18 0.15 2.72 3.82   1.1 0.46     0.53  0
## [1] "Quantiles for White Wine: pH"
##   0%  25%  50%  75% 100% 
## 2.72 3.09 3.18 3.28 3.82 
## [1] "Interquartile Range for White Wine: pH"
## [1] 0.19

The pH feature for white wine appears to be quite symmetrically distributed with the mean and median taking on almost the same value. The right-hand tail of the of the distribution is very slightly longer than that of the left-hand tail.

## [1] "Summary statistics for White Wine: Sulphates"
##    vars    n mean   sd median trimmed mad  min  max range skew kurtosis se
## X1    1 4897 0.49 0.11   0.47    0.48 0.1 0.22 1.08  0.86 0.98     1.59  0
## [1] "Quantiles for White Wine: Sulphates"
##   0%  25%  50%  75% 100% 
## 0.22 0.41 0.47 0.55 1.08 
## [1] "Interquartile Range for White Wine: Sulphates"
## [1] 0.14

Sulphates for white wine lacks a distinct peak in its distribution and might even be bimodal. Like many of the features seen so far, it is skewed to the right with many more outliers beyond the right-hand tail than the left.

## [1] "Summary statistics for White Wine: Alcohol"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 10.51 1.23   10.4   10.43 1.48   8 14.2   6.2 0.49     -0.7 0.02
## [1] "Quantiles for White Wine: Alcohol"
##   0%  25%  50%  75% 100% 
##  8.0  9.5 10.4 11.4 14.2 
## [1] "Interquartile Range for White Wine: Alcohol"
## [1] 1.9

Alcohol for white wine is very broadly distributed with its range. It isn’t quite uniformly distributed as there are more data that have lower values for alcohol than higher and there is a peak around the first quartile, but the distribution to the right of the peak falls off far more gradually than it does to the left.

## [1] "Summary statistics for White Wine: Quality"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4897 5.88 0.89      6    5.85 1.48   3   9     6 0.16     0.21 0.01
## [1] "Quantiles for White Wine: Quality"
##   0%  25%  50%  75% 100% 
##    3    5    6    6    9 
## [1] "Interquartile Range for White Wine: Quality"
## [1] 1

This is the first distribution we have seen that shows discrete values with a very small number of different values. The quality for white wine is represented by an integer and only takes the values between 3 and 9 inclusive. Even though the values are so discrete, the distribution is quite symmetrical with the mean very close to the median.

Let’s also take a look at the kernel density estimate for the quality feature.

Even though this plot is very similar to the histogram, the fact that the scale is normalised will allow us to more directly compare it with the same plot for red wine.

Red Wines

Now let’s take a look at the general distribution for each of the red wine features. As this is the first time we are looking at each of these distributions, there isn’t a specific justification for each plot apart from the fact that we want to see how that feature is distributed.

## [1] "Summary statistics for Red Wine: Fixed Acidity"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 8.32 1.74    7.9    8.15 1.48 4.6 15.9  11.3 0.98     1.12 0.04
## [1] "Quantiles for Red Wine: Fixed Acidity"
##   0%  25%  50%  75% 100% 
##  4.6  7.1  7.9  9.2 15.9 
## [1] "Interquartile Range for Red Wine: Fixed Acidity"
## [1] 2.1

Fixed acidity for red wine has quite a broad peak with a long stretched out tail to the right. There are a few outliers beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Volatile Acidity"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 0.53 0.18   0.52    0.52 0.18 0.12 1.58  1.46 0.67     1.21  0
## [1] "Quantiles for Red Wine: Volatile Acidity"
##   0%  25%  50%  75% 100% 
## 0.12 0.39 0.52 0.64 1.58 
## [1] "Interquartile Range for Red Wine: Volatile Acidity"
## [1] 0.25

Volatile acidity for red wine has an even broader peak with steep sloping tails. Like many of the plots we have seen for white wine, it is skewed right with many more outliers at the end of and beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Citric Acid"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis se
## X1    1 1599 0.27 0.19   0.26    0.26 0.25   0   1     1 0.32    -0.79  0
## [1] "Quantiles for Red Wine: Citric Acid"
##   0%  25%  50%  75% 100% 
## 0.00 0.09 0.26 0.42 1.00 
## [1] "Interquartile Range for Red Wine: Citric Acid"
## [1] 0.33

Citric Acid for red wine is quite hard to understand when plotted with the same bin width as the other histograms. Let’s take a closer look at this plot and see if we can clarify the shape of the distribution.

It appears that citric acid for red wine is somewhat uniformly distributed with a few peaks at 0, 0.2, 0.24, and 0.49. The values begin to tail off above 0.5 without any values between 0.79 with a few outliers at 1.0.

## [1] "Summary statistics for Red Wine: Residual Sugar"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 2.54 1.41    2.2    2.26 0.44 0.9 15.5  14.6 4.53    28.49 0.04
## [1] "Quantiles for Red Wine: Residual Sugar"
##   0%  25%  50%  75% 100% 
##  0.9  1.9  2.2  2.6 15.5 
## [1] "Interquartile Range for Red Wine: Residual Sugar"
## [1] 0.7

Residual sugar for red wine has a very narrow peak below the mean and a very long right-hand tail with many outliers beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Chlorides"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 0.09 0.05   0.08    0.08 0.01 0.01 0.61   0.6 5.67    41.53  0
## [1] "Quantiles for Red Wine: Chlorides"
##    0%   25%   50%   75%  100% 
## 0.012 0.070 0.079 0.090 0.611 
## [1] "Interquartile Range for Red Wine: Chlorides"
## [1] 0.02

Chlorides for red wine has a distribution very similar to that of residual sugar. It has a very narrow peak below the mean and a very long right-hand tail with many outliers beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Free Sulfur Dioxide"
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 1599 15.87 10.46     14   14.58 10.38   1  72    71 1.25     2.01 0.26
## [1] "Quantiles for Red Wine: Free Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    1    7   14   21   72 
## [1] "Interquartile Range for Red Wine: Free Sulfur Dioxide"
## [1] 14

Free sulfur dioxide for red wine has a peak very far to the left with a steep left-hand tail and very long and gradual right-hand tail. There are quite a few outliers at the end of and beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Total Sulfur Dioxide"
##    vars    n  mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 1599 46.47 32.9     38   41.84 26.69   6 289   283 1.51     3.79 0.82
## [1] "Quantiles for Red Wine: Total Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    6   22   38   62  289 
## [1] "Interquartile Range for Red Wine: Total Sulfur Dioxide"
## [1] 40

Total sulfur dioxide for red wine has a very similar shape to free sulfur dioxide for red wine. The distribution has a peak very far to the left with a steep left-hand tail and very long and gradual right-hand tail though this right-hand tail is a little less gradual than the one for free sulfur dioxide. There are fewer outliers beyond the right-hand tail.

## [1] "Summary statistics for Red Wine: Density"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 1599    1  0      1       1   0 0.99   1  0.01 0.07     0.92  0
## [1] "Quantiles for Red Wine: Density"
##       0%      25%      50%      75%     100% 
## 0.990070 0.995600 0.996750 0.997835 1.003690 
## [1] "Interquartile Range for Red Wine: Density"
## [1] 0.002235

Density for red wine is very symmetrically distributed with a mean and median of the same value and a skew statistic close to 0.

## [1] "Summary statistics for Red Wine: pH"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 3.31 0.15   3.31    3.31 0.15 2.74 4.01  1.27 0.19      0.8  0
## [1] "Quantiles for Red Wine: pH"
##   0%  25%  50%  75% 100% 
## 2.74 3.21 3.31 3.40 4.01 
## [1] "Interquartile Range for Red Wine: pH"
## [1] 0.19

pH for red wine is also very symmetrically distributed with a mean and median of the same value though its skew statistic is slightly larger than that of density for red wine. There are a few outliers beyond both tails.

## [1] "Summary statistics for Red Wine: Sulphates"
##    vars    n mean   sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1599 0.66 0.17   0.62    0.64 0.12 0.33   2  1.67 2.42    11.66  0
## [1] "Quantiles for Red Wine: Sulphates"
##   0%  25%  50%  75% 100% 
## 0.33 0.55 0.62 0.73 2.00 
## [1] "Interquartile Range for Red Wine: Sulphates"
## [1] 0.18

Sulphates for red wine is skewed to the right with a peak just to the left of the mean. There are quite a few outliers beyond the right-hand tail of the distribution.

## [1] "Summary statistics for Red Wine: Alcohol"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 10.42 1.07   10.2   10.31 1.04 8.4 14.9   6.5 0.86     0.19 0.03
## [1] "Quantiles for Red Wine: Alcohol"
##   0%  25%  50%  75% 100% 
##  8.4  9.5 10.2 11.1 14.9 
## [1] "Interquartile Range for Red Wine: Alcohol"
## [1] 1.6

Alcohol for red wine is very skewed to the right with a peak near the first quartile. It isn’t quite as uniformly distributed as the distribution for alcohol for white wine, but its right-hand tail is still quite gradual with an outlier beyond the end of the right-hand tail.

## [1] "Summary statistics for Red Wine: Quality"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1599 5.64 0.81      6    5.59 1.48   3   8     5 0.22     0.29 0.02
## [1] "Quantiles for Red Wine: Quality"
##   0%  25%  50%  75% 100% 
##    3    5    6    6    8 
## [1] "Interquartile Range for Red Wine: Quality"
## [1] 1

Like the quality distribution for white wine, this is the only distribution we have seen for red wine that shows discrete values with a very small number of different values. The quality for red wine is represented by an integer and only takes the values between 3 and 8 inclusive (one fewer values than white wine). Even though the values are so discrete, the distribution is quite symmetrical with the mean almost half way between the two middle values. There are more data that have higher quality scores than lower.

Let’s also take a look at the kernel density estimate for the quality feature.

Even though this plot is very similar to the histogram, the fact that the scale is normalised will allow us to more directly compare it with the same plot for white wine. Though the distribution for both is quite symmetrical, the distribution for red wine is a bit broader peak than that for white wine.

Univariate Analysis

What is the structure of your dataset?

The two datasets contain a total of 6497 observations with 4898 for white wines and 1599 for reds. Each observation is described by 12 variables: (this description of the variables comes from the data description authored by Cortez et al.)

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
  12. quality (score between 0 and 10). This is the output variable (based on sensory data).

All 12 variables are numeric. The following variables represent integer values: free sulfur dioxide, total sulfur dioxide, and quality. The other 9 variables represent floating point numbers.

The histograms for each of the 12 features for both data sets give us a good indication of the distributions. Nearly all of the histograms are skewed to the right with more outliers in the right-hand tails. The notable exceptions are pH which appears to be reasonably normally distributed as is density for red wine. The quality histograms immediately stand out because that feature for both wines only contains integer values between 3 and 9 for white wines and 3 and 8 for reds. The disparity between the number of observations for whites compared to reds becomes very clear. Some of the plots for the red wines are much less clearly defined because there are so many fewer observations. In particular, the citric acid plot for red wine has a distinct lack of structure in the distribution.

What is/are the main feature(s) of interest in your dataset?

There isn’t a particular feature of interest in the data that stands out apart from the quality for each type of wine. What I am interested in finding out is whether there are any distinct differences in the physical characteristics between white and red wine. Just as importantly, I would like to know if there is a strong correlation between any of the physical characteristics of the wine and the perceived quality of that wine. If there are, are these physical qualities different for white and red wines?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Only further investigation will determine if this is true, but I have a feeling that extreme values in the physical characteristics of a wine will negatively impact the quality. This is merely a conjecture, but I think if the acidity is too low or two high or the sulphur dioxide is too low or too high, this is likely to lead to particularly low scoring wines. If this is true, I imagine then that wines that tend reside near the peaks for each of the physical characteristics will have above average quality scores.

Did you create any new variables from existing variables in the dataset?

No. I couldn’t think of a characteristic of the data that needed to be described by a new variable composed of the existing variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data is already tidy, so further tidying wasn’t needed.

I performed outlier analysis using Cook’s distance based on a linear model for the quality feature for each of the two types of wine with a threshold of 1. This analysis found a single outlier in the white wine data. It has been removed for all subsequent analysis. Some of the individual features have clear outliers in their distributions, but because the Cook’s distance didn’t flag them as having disproportional influence on the data as a whole, these records are unlikely to be outliers for more than a few of the features. Removing them from the data would be a mistake. Instead, these feature specific outliers may be excluded from individual plots when doing further analysis, but remain within the data as a whole.

Of all 24 distributions the citric acid observations for red wine required further analysis. This is primarily because the distribution wasn’t clear when plotted with the same bin width as the other histograms. Even when the bin width was decreased to get a better sense of the distribution’s shape, it was found that distribution lacked a coherent modal or even multi-modal shape.

Bivariate Plots Section

To begin the bivariate analysis of each type of wine, let’s take a look at the correlations for each of the features. It will be useful to know which pair pf features have the strongest correlations and which features are most strongly correlated with the quality feature.

Correlations

White Wine Feature Correlations

The White Wine features don’t appear to be very highly correlated with each other. There are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5 and only 1 pair with a score less than -0.5.

  • Density / Residual Sugar: 0.83
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.62
  • Density / Total Sulfur Dioxide: 0.54
  • Density / Alcohol: -0.8

Red Wine Feature Correlations

The Red Wine features don’t appear to be very highly correlated with each other either. Like the White Wines, there are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5, but there are four pairs with a score less than or equal to 0.5. Interestingly, there are only two that overlap with the White Wines.

  • Density / Fixed Acidity: 0.67
  • Citric Acid / Fixed Acidity: 0.67
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.67
  • Density / Alcohol: -0.5
  • Citric Acid: / pH: -0.54
  • Citric Acid / Volatile Acidity: -0.55
  • pH / Fixed Acidity: -0.68

The largest correlation scores with the quality of each of the types of wine are with the alcohol content (r 0.44 for White Wine and r 0.48 for Red Wines). These aren’t particularly large scores and likely shed more light on how some of the reviewers providing the quality scores prefer stronger drinks than the wine itself.

Now we have isolated the features that have the strongest relationships with each other. Let’s take a closer look at each of these 11 bivariate distributions and their relationships.

White Wines

What do the four strongest relationships in the white wines data look like (from the strongest positive correlation to the strongest negative)?

Density / Residual Sugar: r 0.83

This is the strongest relationship between any two features for either white or red wine. Let’s see what it looks like

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Density: 4"
## [1] "Additional data considered outliers for Residual Sugar: 4"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 4889 0.99  0   0.99    0.99   0 0.99   1  0.02 0.24    -0.77  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Residual Sugar (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4889 6.35 4.94    5.2    5.79 5.34 0.6 20.8  20.2 0.73    -0.53 0.07

Only 8 outliers have been excluded (4 for each feature) from this plot. The relationship is very clear. If there weren’t so many data with very low residual sugar values I suspect the relationship would be more linear. These low values are clearly pulling the trend line downward for lower density wines. The plot reinforces the positive correlation between the two features and shows that high density wines tend to have more residual sugar. Perhaps wine density is primarily a product of the amount of residual sugar. We can’t be certain of this though.

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.62

The relationship between total sulfur dioxide and free sulfur dioxide is the next strongest positive relationship for white wine. We know they are positively correlated, but the lower correlation score suggests that the scatter plot will be more spread out than that of density and residual sugar. Let’s see what it looks like.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Total Sulfur Dioxide: 22"
## [1] "Additional data considered outliers for Free Sulfur Dioxide: 46"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Total Sulfur Dioxide (excluding outliers)"
##    vars    n  mean    sd median trimmed mad min max range skew kurtosis   se
## X1    1 4829 137.3 41.11    133  136.18  43  18 251   233 0.22    -0.32 0.59
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Free Sulfur Dioxide (excluding outliers)"
##    vars    n  mean    sd median trimmed   mad min  max range skew kurtosis   se
## X1    1 4829 34.61 15.38     33   34.05 16.31   2 79.5  77.5 0.32    -0.43 0.22

The distribution of points is indeed more broadly scattered, but the trend line has a more consistent shape. The distribution of points broadens as total sulfur dioxide increases but are also more examples in the data for total sulfur dioxide levels to the right of the distribution than to the left.

We have excluded 68 outliers in this plot with nearly twice as many outliers for free sulfur dioxide than total sulfur dioxide (22 for total and 46 for free). As the correlation suggests, the relationship between the two features is positive.

Density / Total Sulfur Dioxide: r 0.54

The relationship between density and total sulfur dioxide is the last one for white wine with a positive correlation greater than 0.5. Let’s see what the plot looks like.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Density: 4"
## [1] "Additional data considered outliers for Total Sulfur Dioxide: 22"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 4871 0.99  0   0.99    0.99   0 0.99   1  0.02 0.25    -0.76  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Total Sulfur Dioxide (excluding outliers)"
##    vars    n   mean    sd median trimmed mad min max range skew kurtosis   se
## X1    1 4871 137.79 41.33    134  136.67  43  18 251   233 0.22    -0.34 0.59

Only 26 outliers have been excluded from this plot. As we have already seen these features plotted against other features, it isn’t a surprise that same number of features have been excluded (4 for density and 22 for total sulfur dioxide). The trend line for these two features is much less consistent in its shape. The overall trend is positive, but there is a distinct ‘wobble’ in the middle of the data that makes it hard to draw any conclusions about the slope of the trend. Even though the trend is generally positive, the spread of the distribution is considerable.

Density / Alcohol: r -0.8

The relationship between density and alcohol for white wine is the second strongest for all of the data and the strongest negative relationship. Let’s see what the plot looks like.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Density: 4"
## [1] "Additional data considered outliers for Alcohol: 0"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 4893 0.99  0   0.99    0.99   0 0.99   1  0.02 0.25    -0.76  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4893 10.51 1.23   10.4   10.43 1.48   8 14.2   6.2 0.49     -0.7 0.02

With only 4 outliers for density and no outliers for alcohol, we can be quite confident about the shape of this plot. Even tough the trend slope isn’t consistent for the whole distribution, there is a distinct relationship. Alcohol decreases as density increases and the relationship between alcohol and density is more prominent for low measurements of wine density. Once density increases above roughly 0.993 g / cm^3, the relationship flattens out and all the wines tend to have a medium to low amount of alcohol.

Red Wines

What do the seven strongest relationships in the red wines data look like (from the strongest positive correlation to the strongest negative)?

There are three pairs of features that all have a positive correlation of r 0.67. these three are the only pairs of features for red wine that have a positive correlation greater than 0.5

Let’s take a look at the first of these.

Density / Fixed Acidity: r 0.67

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Density: 45"
## [1] "Additional data considered outliers for Fixed Acidity: 45"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range  skew kurtosis se
## X1    1 1509    1  0      1       1   0 0.99   1  0.01 -0.04    -0.07  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Fixed Acidity (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1509 8.16 1.46    7.8    8.05 1.33 4.6  12   7.4 0.61    -0.14 0.04

For this plot we have excluded 90 outliers (45 each for density and fixed acidity). We have far fewer data for red wine so it is not a surprise that the plots are more scattered and the correlation scores are lower. Despite this limitation, the trend between density and fixed acidity for red wine is positive and consistent. The slope of the trend does not change very much as both density and fixed acidity increase.

Citric Acid / Fixed Acidity: r 0.67

The next of the three relationships is for citric acid and fixed acidity. As both values measure a quantity of acidity, it isn’t a surprise that the correlation is relatively high and positive.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Citric Acid: 1"
## [1] "Additional data considered outliers for Fixed Acidity: 59"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Citric Acid (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 1539 0.26 0.19   0.25    0.25 0.24   0 0.78  0.78 0.32    -0.84  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Fixed Acidity (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1539 8.14 1.48    7.8    8.04 1.33 4.6  12   7.4 0.55    -0.17 0.04

There is only 1 outlier for citric acid but 59 for fixed acidity. The points are quite scattered on either side of the trend line, but the degree of scattering is consistent for the range of citric acid values. The trend is mostly positive, but flattens at for data with high levels of citric acid. The confidence interval for this part of the trend line is also wider as there are proportionally fewer data with citric acid levels higher than 0.6 g / dm^3.

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.67

The last of the three relationships with a correlation score of 0.67 is between total sulfur dioxide and free sulfur dioxide. Let’s see what the plot looks like.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Total Sulfur Dioxide: 71"
## [1] "Additional data considered outliers for Free Sulfur Dioxide: 42"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Total Sulfur Dioxide (excluding outliers)"
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 1486 41.21 25.67     35   38.14 23.72   6 116   110 0.91     0.05 0.67
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Free Sulfur Dioxide (excluding outliers)"
##    vars    n  mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 1486 14.32 8.48     13   13.44 8.9   1  38    37 0.76    -0.23 0.22

Of all the plots we have seen so far, this one has the largest number of outliers at 113 (71 for total sulfur dioxide and 42 for free sulfur dioxide). The relationship between these two features is very clear for data with total sulfur dioxide less than 50 mg / dm^3. For this subset of the data, the relationship between the features is positive and clear. The spread of free sulfur dioxide increases as total sulfur dioxide increases resulting a in a cone of data centred around the trend line. The density of the plot decreases as total sulfur dioxide increases and for data with total sulfur dioxide levels above 60 mg / dm^3 there doesn’t appear to be any relationship between the features.

Density / Alcohol: r -0.5

The relationship between density and alcohol is the first for red wine with a negative correlation and has the weakest negative of those we are investigating in this part of the report.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Density: 45"
## [1] "Additional data considered outliers for Alcohol: 12"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 1542    1  0      1       1   0 0.99   1  0.01 0.03    -0.13  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1542 10.36 0.98   10.1   10.27 1.04 8.4  13   4.6 0.69    -0.41 0.03

We have excluded a total of 67 outliers in this plot (45 for density and 12 for alcohol). The shape of this relationship is very similar as that for white wine. The level of alcohol decreases as density decreases up to a point and then the trend flattens out. Even though the shape is very similar as it is for white wine, the point in the trend line where the slope changes occur at a higher level than for white wine (roughly 0.993 g / cm^3 for white wine and roughly 0.997 g / cm^3). This may appear very subtle when comparing values, but with such a narrow range of density values, the difference is noticeable.

Citric Acid: / pH: r -0.54

It isn’t a surprise that there is a negative correlation between citric acid and pH as pH is a measurement of acidity. Let’s take a look at the plot and see if the trend is consistent.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Citric Acid: 1"
## [1] "Additional data considered outliers for pH: 37"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Citric Acid (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 1561 0.27 0.19   0.26    0.26 0.24   0 0.79  0.79 0.28    -0.89  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for pH (excluding outliers)"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1561 3.31 0.14   3.31    3.31 0.13 2.94 3.68  0.74 0.05    -0.27  0

The trend is consistently negative with a minor ‘wobble’ in the middle of the data. I think this abnormality is more likely to be an artefact of having so few data for red wine than a true relationship in the physical characteristics of the wine. For this plot we excluded 1 outlier for citric acid and 37 for pH.

Citric Acid / Volatile Acidity: r -0.55

For citric acid and volatile acidity, we have another relationship between two measurements of acidity for red wine. As it is the second strongest negative relationship for red wine, let’s see what the plot looks like.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Citric Acid: 1"
## [1] "Additional data considered outliers for Volatile Acidity: 19"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Citric Acid (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 1579 0.27 0.19   0.26    0.26 0.24   0 0.79  0.79 0.29    -0.88  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Volatile Acidity (excluding outliers)"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1579 0.52 0.17   0.52    0.51 0.18 0.12 1.01  0.89 0.29     -0.3  0

Surprisingly, the relationship between thee two measurements of acidity is not consistent. The trend is negative for values of citric acid less than 0.4 g / dm^3 and flattens out for data with a higher level of citric acid. There is less data for the area of the left to the left of the point of deflection so this could be a confounding factor for the inconsistency in the trend line.

pH / Fixed Acidity: r -0.68

The strongest relationship between features for the red wine data is between pH and fixed acidity. Again, these are two measures of acidity so the high degree of correlation is to be expected. Let’s take a look at the plot.

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for pH: 38"
## [1] "Additional data considered outliers for Fixed Acidity: 54"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for pH (excluding outliers)"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1507 3.32 0.14   3.32    3.32 0.13 2.94 3.68  0.74 0.07     -0.2  0
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Fixed Acidity (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1507 8.16 1.45    7.9    8.06 1.33   5  12     7 0.62    -0.18 0.04

Unlike the previous two plots, the trend line for these two features is quite consistent and shows that fixed acidity decreases as pH increases. We have ignored 38 outliers for pH and 54 for fixed acidity.

Quality

We have seen that alcohol has the strongest relationship with quality for both types of wine. Let’s take a look at what these two distributions look like.

From this plot, we can see that alcohol has a very slight negative relationship with the quality score below a score of 5, but it has a much stronger positive relationship for scores greater than 5. The score with the most outliers is 5 with all of the outliers representing alcohol levels above the interquartile range.

The relationship between alcohol and quality for red wine is very similar to that of white wine following almost the same trends. The main difference, as has been noted previously, is that there are wines with a quality score of 9 for white wines but none for red wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Unfortunately, there doesn’t appear to be a strong relationship between the quality of the wine and any of its physical characteristics. The strongest of these relationships is the one between alcohol and quality though the correlation r score is only 0.44 for white one and 0.48 for red. This suggests that there may be a bias for some of the reviewers toward a stronger wine appearing to be of higher quality though this isn’t something that can be confirmed with the data available (as we can’t determine which reviewed which wine). When tasting wine, the quantity of alcohol is one of the least subtle qualities to detect. It is possible that reviewers latched onto this quality to differentiate their preference for the different wines.

The fact that there isn’t a clear relationship between quality and any specific physical characteristic does tell something about how wine is perceived. It is plausible that the physical character of the wine isn’t the primary predictor of quality. There are likely to be other confounding factors not captured by this data. These could include the colour, context, shape of the glass, or whether the wine was perceived to be cheap or expensive.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It isn’t surprising that the strongest bivariate relationships in the data follow physical relationships in the chemistry and fermentation of wine. The strong relationship between density and residual sugar suggests that the density of the wine is highly influenced by the measure of the residual sugar. Given that the fermentation process causes sugar to turn into alcohol, it makes sense that there is an inverse relationship that is nearly as strong between density and alcohol. There is also a reasonably strong relationship between total sulfur dioxide and free sulfur dioxide (r 0.62 for white wine and 0.67 for red wine). For red wine, there is a strong positive relationship between citric acid and fixed acidity but a moderately strong inverse relationship between citric acid and volatile acidity along with reasonably strong negative relationships between pH and citric acid and pH and fixed acidity. This makes sense as all of these chemical properties are associated with each other.

What is interesting is that the relationships between the acidic features are less prominent for white wine than they are for red wine.

What was the strongest relationship you found?

The strongest relationship I found was between density and residual sugar for white wine. These two features had the largest Pearson r score of 0.83. This was closely followed by a negative correlation between alcohol and density (also for white wine) with an r score of -0.8.

Multivariate Plots Section

We have established that the relationships between quality and the physical features of the wine aren’t very strong, but they do exist. Not only is there some correlation between the quality and physical features, but the nature of these relationships differ for red and white wines. We have also established that strongest relationship is between alcohol and quality so now let’s take a look at the next three strongest relationships for each of the two types of wine, plot them against alcohol and colour the plot by the quality for each datum.

Let’s see what we can find.

White Wines

For white wines the four strongest relationships with quality are:

  • Alcohol (r 0.44)
  • Density (r -0.31)
  • Chlorides (r -0.21)
  • Total Sulfur Dioxide (r -0.17)

The three resulting plots are as follows:

Alcohol vs Density by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 0"
## [1] "Additional data considered outliers for Density: 4"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4893 10.51 1.23   10.4   10.43 1.48   8 14.2   6.2 0.49     -0.7 0.02
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Density (excluding outliers)"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 4893 0.99  0   0.99    0.99   0 0.99   1  0.02 0.25    -0.76  0

From this plot, we can see that not only does density decrease as alcohol increases, but the quality of the wine also increases as density decreases and alcohol increases. As the correlation scores suggest, the relationship between alcohol and quality is stronger than that of density and quality.

The density of the scatter plot makes this a little difficult to read so let’s try plotting this slightly differently to see if breaking the quality into facets shows us anything new.

As we saw in the previous plot, the trend between alcohol and quality is clearer than between density and quality is. This faceted plot clearly shows that the negative relationship between density and alcohol is pretty consistent for all levels of quality.

Alcohol vs Chlorides by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 0"
## [1] "Additional data considered outliers for Chlorides: 207"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4690 10.55 1.23   10.4   10.48 1.33   8 14.2   6.2 0.45    -0.72 0.02
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Chlorides (excluding outliers)"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 4690 0.04 0.01   0.04    0.04 0.01 0.02 0.07  0.06 0.11    -0.22  0

This plot is quite broadly scattered and the relationship between chlorides and quality is far less distinct than it is for alcohol and quality.

207 outliers are quite a few and they are all for the chlorides feature. Let’s take a look to see what the plot looks like when we include these outliers.

When we include the 207 outliers we can see that the large majority of them occur when alcohol levels are between 9% and 10% by volume. They also appear to represent data with lower quality wines. When considering the outliers this does mean that the quality of the wine increases as chlorides decrease but this could simply be an expression of the trend between alcohol and quality.

Alcohol vs Total Sulfur Dioxide by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 0"
## [1] "Additional data considered outliers for Total Sulfur Dioxide: 22"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4875 10.52 1.23   10.4   10.44 1.48   8 14.2   6.2 0.48    -0.71 0.02
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Total Sulfur Dioxide (excluding outliers)"
##    vars    n  mean    sd median trimmed mad min max range skew kurtosis   se
## X1    1 4875 137.8 41.32    134  136.68  43  18 251   233 0.22    -0.34 0.59

As was the case for multivariate plot for chlorides above (without the outliers), this plot is quite broadly scattered and the relationship between total sulfur dioxide and quality is far less distinct than it is for alcohol and quality. This plot excluded 22 outliers all of which were for total sulfur dioxide.

Red Wines

For red wines the four strongest relationships with quality are:

  • Alcohol (r 0.48)
  • Volatile Acidity (r -0.39)
  • Sulphates (r 0.25)
  • Citric Acid (r 0.23)

The three resulting plots are as follows:

Alcohol vs Volatile Acidity by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 17"
## [1] "Additional data considered outliers for Volatile Acidity: 19"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1563 10.39 1.01   10.1   10.29 1.04 8.4 13.3   4.9 0.72    -0.37 0.03
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Volatile Acidity (excluding outliers)"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1563 0.52 0.17   0.52    0.52 0.18 0.12 1.01  0.89 0.27    -0.31  0

Even though few of the points in the plot overlap and obscure the data and relationships between the features, it is still hard to differentiate between the data for wines with qualities of 6, 7 and 8. there is a clear trend for the lower quality wines to have less alcohol, but the trend for volatile acidity is much harder to read.

Let’s try plotting this slightly differently to see if breaking the quality into facets shows us anything new.

This plot much more clearly shows the relationship between alcohol and quality scores. There is also a very slight trend for wines with less volatile acidity to be higher quality. The band of wines with a quality score of 5 show a distinct bias to have less alcohol than more across most of the range for volatile acidity.

Alcohol vs Sulphates by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 17"
## [1] "Additional data considered outliers for Sulphates: 64"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1518 10.4 1.01   10.2   10.31 1.04 8.4 13.3   4.9 0.69     -0.4 0.03
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Sulphates (excluding outliers)"
##    vars    n mean   sd median trimmed mad  min  max range skew kurtosis se
## X1    1 1518 0.63 0.12   0.61    0.63 0.1 0.33 0.97  0.64 0.57    -0.15  0

Again, this plot shows that red wines with alcohol levels higher than 10% by volume tend to be higher quality, but the relationship with sulphates is less clear. There appears to be more data toward the lower range of sulphates with lower quality scores but the same trend is hard to discern for higher quality wines. This plot excludes 17 outliers for alcohol and 64 outliers for sulphates.

Alcohol vs Citric Acid by Quality

## [1] "Outliers have been filtered out that lie outside of 2 times the IQR from the median (+/-)."
## [1] "Number of data considered outliers for Alcohol: 17"
## [1] "Additional data considered outliers for Citric Acid: 1"
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Alcohol (excluding outliers)"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1581 10.39 1.01   10.1   10.29 1.04 8.4 13.3   4.9 0.71    -0.37 0.03
## [1] "---------------------------------------------------------------------"
## [1] "Summary statistics for Citric Acid (excluding outliers)"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 1581 0.27 0.19   0.26    0.26 0.24   0 0.79  0.79 0.29    -0.87  0

This plot for citric acid vs alcohol by quality is very similar to the previous plot for sulphates. The relationship between alcohol and quality is much more discernible than the relationship between citric acid and quality. This plot excludes 17 outliers for alcohol and 1 outlier for citric acid.

Of all these plots, the plot of the relationship between citric acid and alcohol for red wines is the most spread out. Citric acid is most evenly distributed for wines with lower levels of alcohol whereas higher alcohol wines tend to have citric acid levels below 0.125 or between 0.3 and 0.7 g / dm^3.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By looking at the relationships between alcohol (the feature with the strongest relationship with quality) and features with the 2nd through 4th strongest relationships with quality for both white and red wines, we begin to get a sense of what influences the quality score for a given wine. The correlation matrices for each type of wine showed us that these relationships differ for white and red wines. The multivariate plots tell a subtle story. The strength of the relationship between alcohol and quality (common for all 9 plots explored) is quite clear. It is the relationship between quality and the other six features that is far more subtle.

Were there any interesting or surprising interactions between features?

When plotting chlorides and alcohol by quality for white wines, there is a distinct spike in the variability of a wine’s chloride levels when its alcohol is between 8.5 and 10. This shape was only visible when the outliers were included in the plot. For these wines that have elevated chlorides, their quality tends to be low or medium (between 3 and 6 out of 9).

My idea that extreme values of sulfur dioxide would result in lower quality scores is validated (to some extent) when plotting total sulfur dioxide and alcohol by quality for white wines. For this plot, the highest quality wines tend to alcohol levels above 10% and total sulfur dioxide levels between 75 and 200 mg / dm^3.

The plots for chlorides vs alcohol for white wine and sulphates vs alcohol for red wine have a very similar shape though the shape is less clear for red wines. This is probably because there is less data for red wines. The greatest variability of sulphates in red wine appears to occur in wines with less than average alcohol content. Though the wines with large amounts of sulphate appear to be lower quality, this relationship is less clear than that of chlorides and quality for white wine.


Final Plots and Summary

Plot One

One of the most interesting concepts pursued during this exploration is the idea that white wines and red wines differ at a much more fundamental level than their colour. To begin with, the reviewers who scored the wines tend to be more critical of red wines than white wines. This plot compares the kernel density estimates for distribution of quality scores for the two types of wines. This is an effective way of comparing these two distributions as it normalises the differences that are introduced by the disparity in size of the two data sets. Comparing two histograms would be fruitless when there are more than twice as many white wines in the data.

From this plot, we can clearly see that white wines are more likely to receive a score of 6 or higher than reds. White wines are also more likely to receive a very high score where reds are far more likely to receive a score between 5 and 7 with many more receiving 5 than 7. It is hard to see in this plot as the number of wines receiving a score of 9 is very small, but all the wines receiving a quality score of 9 are white wines.

Plot Two

Even though the quality KDE plot, shown previously, indicated that the distribution of quality scores for white and red wine do differ subtly, the relationships between alcohol levels and quality for red and white wine are very similar. Even though there is a positive correlation between alcohol content and wine quality, this relationship only affects wines with a quality score greater than or equal to 5. For wines that received a quality score less than or equal to 5, the relationship is less clear. For white wines, there appears to be a negative relationship between alcohol and quality, but there is a lot of overlap of the interquartile range for these three box plots. The trend for these lowest three quality levels for red wine is even flatter than it is for white wine. The box plots help make it very clear that only white wines received a quality score of 9.

I chose to keep the outliers for this plot. There aren’t many of them and the box plots make them very clear. I also felt it was important to keep the most distant outlier (the red wine with a high alcohol content but a quality score of 5) to show that the wine with the highest alcohol content by volume didn’t receive the highest quality score.

When I first explored the relationship between alcohol and quality for each type of wine, I tried using scatter plots but there was a severe over plotting problem. The over-plotting obscured the deflection point in the trend between the two features. The trend became much clearer when the scatter plot was replaced with a series of box plots.

Plot Three

NB This plot excludes outliers that fall outside 3 times the interquartile range from the median (+/-). This has excluded 2 outliers for wine density and 115 outliers for volatile acidity. The plots are clearer without these outliers and the general trend of the plot is unaffected. A higher threshold has been used for this plot compared to other plots in the project. A threshold of 2 times the IQR excludes 490 records for volatile acidity and this did affect how the relationship between high levels of volatile acidity and quality was presented by the plot.

With this third plot, I wanted to illustrate how white wines and red wines are subtly different in their physical characteristics and how this impacts their quality. The lack of clear relationships in the data made this task very difficult. The fact that the quality of white and red wines is influenced by different physical characteristics makes direct comparison quite challenging. I chose to compare the two features that influence red and white wine quality the most but only apply to one type of wine or the other. As we discovered earlier, the quality for both types of wine is most influenced by the quantity of alcohol. The next most influential feature for white wine is density and for red wine, it is volatile acidity.

I tried to layer alcohol and quality information into the plot by using those values to control the alpha and size of the points. This proved hard to read so I changed the quality score to a facet and kept the alcohol value controlling the point size. Moving from a single plot to a faceted plot made a tremendous difference to improve the readability of the plot. Excluding outliers also allowed the data to fill the plot which also improved the readability.

The plot shows that red wine tends to have both a higher density and greater quantity of volatile acidity than white wine. As the quality of a wine increases, the volatile acidity decreases. The change in volatile acidity for red wine as quality increases is more pronounced than it is for white wine. The density of white wine also decreases as quality increases while the density for red wine is pretty consistent across the quality facets.


Reflection

This data appears to validate the notion that wine flavours can be very subtle and it takes a discerning palette to differentiate between the traits of different wines. Though there are relationships between the different physical characteristics of a wine and its quality, these relationships are quite weak. Even alcohol, with the strongest relationship with quality, still has a Pearson product-moment correlation coefficient (r) score less than 0.5 for both white and red wines.

These subtle relationships made this exploration quite challenging. Without clear relationships to latch on to, the exploration proved quite nebulous and vague. The clearest finding from the exploration is that the difference between white and red wines isn’t merely their colour. The characteristics that help differentiate a low-quality wine and a high-quality one are different for red and white wines though these relationships are weak and may be ignored if the data contained more striking correlations.

Rendering the joint correlations plot was a struggle and took quite a lot of tweaking. I’m not entirely happy with it. Ideally, I would like the classification of the two triangles (red and white) to be clear visually (rather than requiring a textual description) but was unable to render the distinction as I had wanted. For some reason, lines I was drawing over the top of the plot were getting clipped and not displaying half down the last row of the plot. I felt like my attempts looked messy, so I decided not to pursue it further.

Rendering the correlation matrices were the most illuminating part of the exploration process. Seeing the differences between the correlation values for white and red wines was the first place I saw evidence that the physical characteristics of the wines not only differ between red and white wines, but these characteristics do have differing impacts on the perceived quality of the two types of wine.

I would be interested in performing further analysis with wine data that included other data that would likely influence perceived quality such as perceived flavour (sweet, dry, etc.), colour, context, the shape of the glass, or whether the wine was perceived to be cheap or expensive. It would also be interesting to know if individual reviewers reviewed multiple wines and whether there were trends in their reviews. For example, were individual reviewers more or less sensitive to certain physical features of the wine?

Resources